Looping to import delimited a large .csv by parts with range `forgets´ variables

Carlos Aburto-Castellanos

Join Date: Sep 2020

Posts: 2
#1

Looping to import delimited a large .csv by parts with range `forgets´ variables

08 Sep 2020, 14:16

Hello.
I want to work with an csv. The file is more than 20 GB heavy so i cannot read it normally. I was trying to import it by parts using rowrange, to filter what i need, save it and then append them.

The issue is that at some point, the import doesn't keep the variables names and names them based on the first observation of that chunk or something like that.
I have tried different chunk sizes, starting the loop at different points ans this keeps coming up, at differents points of the data. Any thoughts on why this happens?
(More than welcome to hear other possible approaches to this as well).

Here is my code:

{

Code:

local k = 1 local j = 500000 local h = 1 while k < 15000000 { import delimited using "file.csv", rowrange(`k':`j') clear replace estado = "MEXICO" if estado == "MÃXICO" replace estado = "CDMX" if estado == "DISTRITO FEDERAL" keep if inlist(estado, "CDMX", "MEXICO", "HIDALGO") // This drops around 70% of the data, on average save file_chunk_`h'.dta, replace local k = `k' + 500000 local j = `j' + 500000 local h = `h' + 1 }

The size of the chunk (500,000) and the limit of the while (15 mill) are arbitrary. I don't know how many observations i have (I believe something around 65 mill). As I said, I have tried different chunk sizes and lenghts of the while. With this numbers, at the 15th loop, when it pulls from 7,000,001 to 7,500,000 it does it "succesfully" but forgets the variable names and when i try to to replace estado it naturally issues a "variable estado not found r(111);" I tried independently importing from 6,900,000 to 7,100,000, from 7,100,000 to 7,200,000 and so on and it works perfectly,

Any suggestions?

Thanks in advance.

Carlos Aburto Castellanos

Last edited by Carlos Aburto-Castellanos; 08 Sep 2020, 14:18.
Tags: CSV, import delimited, rowrange
Mike Lacy

Join Date: Apr 2014

Posts: 2450
#2

08 Sep 2020, 16:43

What you describe does sound like a bug in -import delimited-, but I would nevertheless try your code on a much smaller file with appropriately adjusted k and j.

As for a workaround: There is a community contributed program -chunky-, available at SSC, that will break a large text file into a bunch of smaller chunks. I wonder if you could use it, import each chunk, and then put them together with append. You would need to find a convenient way to insert the variable name line into the beginning of each chunk, about which I don't have any quick ideas.
1 like
Comment
Carlos Aburto-Castellanos

Join Date: Sep 2020

Posts: 2
#3

09 Sep 2020, 12:33

Hi, thanks very much for the reply, Mike.

-chunky- works great. It is very simple and even before actually chunking the data,

Code:

chunky using ..., analyze

is very useful for knowing what you are working with.

Regarding the variable names, the option - header(include) - takes perfectly good care of that.

Thanks a lot,

C
Comment
Katelin Phelps

Join Date: Feb 2021

Posts: 5
#4

10 Feb 2021, 12:29

I'm very new to stata and trying to do a very similar thing (I'm self taught). I'm able to use the chunky command, however, I need to filter out parts of the data depending on a single variable and then save those data points as chunks (my data set has 50 million+ values in it)

I'm using this code for the chunky command:

chunky using "/Users/katelinphelps/Documents/ORIGINAL DATA SET.txt", chunksize(1 GB) header(include) stub("allvalues")

And this code to filter out the data i need
keep if deedsitusstatestatic == "NY"

But how do I combine those two commands into one set of code?
Comment

Announcement

Looping to import delimited a large .csv by parts with range `forgets´ variables

Comment

Comment

Comment